DAMD hashtag cooccurrence graph (re)construction

Let's look at the shape of the data about DAMD and see how to computationally construct a graph, and how that compares to doing so with an interactive tool, such as Table 2 Net.

Reading the data from a file into Python

Ok we have received a nice data file. First we can take a look with Excel, or a text editor. Assuming the file 20170718 hashtag_damd uncleaned.csv has been placed in the same directory as this notebook, we can also take a peek in Python.



In [2]:

    
# first we want some Python tools to make our lives easier

import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt

%matplotlib inline



In [2]:

    
with open("20170718 hashtag_damd uncleaned.csv") as fd:
    for row in fd.readlines()[:3]:
        print(row)









    



"","tweet_id","user_id","user_name","reply_to_id","created","message","geodata","place_id","place_type","place_name","place_country","language","retweet_count","hashtags","user_mentions_name","user_mentions_id","urls","media_id","media_type","media_url"

"1","885401672448589824",43302304,"Motor Mavens","NULL","Thu Jul 13 07:33:03 +0000 2017","The @oemaudioplus #86Vantage's interior just looks so #DAMD upscale! And sounds upscale too. The crisp sound and... https://t.co/bUXNNPNHbQ","NULL","NULL","NULL","NULL","NULL","en",0,"86Vantage;DAMD","OEM AUDIO PLUS","137555927","http://fb.me/6IdLxl68T","NULL","NULL","NULL"

"2","772829925279752196",94512824,"Caspar de Kiefte","NULL","Mon Sep 05 16:13:07 +0000 2016","#DAMD -&gt; via Kunstenbond onderdeel van internationaal netwerk waaronder Directors Guild of America https://t.co/qGBOMakAQ8","NULL","NULL","NULL","NULL","NULL","nl",0,"DAMD","NULL","NULL","http://damd.nl/nieuws/damd-via-kunstenbond-verbonden-in-internationaal-netwerk/","NULL","NULL","NULL"

That looks like a comma-separated value (CSV) file. There are many other kinds of files for data, but these are quite typical. In a CSV, each line is a data item (a tweet in this case), and columns are variables for each item. We call such a thing a data frame.



In [4]:

    
damd = pd.read_csv("20170718 hashtag_damd uncleaned.csv")

What variables do we have?



In [5]:

    
damd.columns









    Out[5]:





Index(['Unnamed: 0', 'tweet_id', 'user_id', 'user_name', 'reply_to_id',
       'created', 'message', 'geodata', 'place_id', 'place_type', 'place_name',
       'place_country', 'language', 'retweet_count', 'hashtags',
       'user_mentions_name', 'user_mentions_id', 'urls', 'media_id',
       'media_type', 'media_url'],
      dtype='object')

Let's decide to use the tweet_id as index. It is an unique identifier for the tweets.



In [6]:

    
damd = pd.read_csv("20170718 hashtag_damd uncleaned.csv", index_col="tweet_id")
damd.head(3)









    Out[6]:







  
    
      
      Unnamed: 0
      user_id
      user_name
      reply_to_id
      created
      message
      geodata
      place_id
      place_type
      place_name
      place_country
      language
      retweet_count
      hashtags
      user_mentions_name
      user_mentions_id
      urls
      media_id
      media_type
      media_url
    
    
      tweet_id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      885401672448589824
      1
      43302304
      Motor Mavens
      NaN
      Thu Jul 13 07:33:03 +0000 2017
      The @oemaudioplus #86Vantage's interior just l...
      NaN
      NaN
      NaN
      NaN
      NaN
      en
      0
      86Vantage;DAMD
      OEM AUDIO PLUS
      137555927
      http://fb.me/6IdLxl68T
      NaN
      NaN
      NaN
    
    
      772829925279752196
      2
      94512824
      Caspar de Kiefte
      NaN
      Mon Sep 05 16:13:07 +0000 2016
      #DAMD -&gt; via Kunstenbond onderdeel van inte...
      NaN
      NaN
      NaN
      NaN
      NaN
      nl
      0
      DAMD
      NaN
      NaN
      http://damd.nl/nieuws/damd-via-kunstenbond-ver...
      NaN
      NaN
      NaN
    
    
      828122222111764480
      3
      798400767975686144
      Bec
      NaN
      Sun Feb 05 06:04:58 +0000 2017
      @Budah96 @sarahbuya4 #Damd Olivia went and too...
      NaN
      NaN
      NaN
      NaN
      NaN
      en
      0
      Damd;Damd;Scandal;sogood
      Spider-Paco The 🌮;Sarah
      165599878;53990004
      NaN
      NaN
      NaN
      NaN

Hashtag co-occurrence graph creation

To find patterns in the data, we might look at #hashtags, and if we can identify some interesting patterns in them. Cooccurrence is a useful thing to look at, and can easily be done in Twitter data.

We might want to bipartite graph ("network") $g = \langle N, V \rangle$, where $N = \{{node}_1, {node}_2 \ldots {node}_n\}$ is a set of nodes ("spheres"), and $V = \{{\langle source, target \rangle_1, \langle source, target \rangle _2 \ldots \langle source, target \rangle _m }\}$ set of edges ("lines") of tweets and hashtags, to analyze hashtag co-occurrence.

A bipartite graph has two types of nodes, which are not connected within the type, only across. In our case, hashtags are connected to tweets, but tweets are not directly connected to tweets, and hashtags are not directly connected to hashtags. Makes sense, right?

This data manipulation process can be done with Table 2 Net. But doing so programmatically is a different way to do it. We will use Python library called NetworkX.

Below is a Gephi visualization of a graph made with Table 2 Net, coloured by node type red for tweets and green for hashtags, and showing labels for the hashtag nodes with degree of 15 or larger. We have used the algorithm ForceAtlas2 in Gephi for positioning the nodes. The central node, hashtag damd has been hidden, because it carries no information.

First let's take a peek at the shape of the hashtags, how they are stored in the data we have received.



In [7]:

    
damd.hashtags.head()









    Out[7]:





tweet_id
885401672448589824                86Vantage;DAMD
772829925279752196                          DAMD
828122222111764480      Damd;Damd;Scandal;sogood
869614229619224576                          Damd
862237577822318592    S206;DAMD;SUBARU;TOPRACING
Name: hashtags, dtype: object

We see that the hashtag column is itself a semicolon separated list, and our data is kind of three dimensional. We need to split it up.

From reading the documentation, we know that nx.Graph.add_edge() requires input as a tuple (source, target), describing one edge. For each tweet, we generate a list of it's hashtags, and then add those edges to the graph one by one. So, from the original data shape

tweet1 hashtag1;hashtag2;hashtag3
tweet2 hashtag9;hashtag4
.
.
.

We create an intermediary data shape for line 5

tweet1 hashtag1
tweet1 hashtag2
tweet1 hashtag3
tweet2 hashtag9
tweet2 hashtag4
.
.
.

This suits what the NetworkX API expects.

Conveniently NetworkX automatically creates the nodes, so we don't have to think about them. How can it automatically know what the nodes are, if it only looks at links?



In [8]:

    
def buildHashtagCooccurrenceGraph(tweets):
    g = nx.Graph(name="Hashtag co-occurrence bipartite")
    for tweet, hashtags in damd.hashtags.astype(str).map(lambda l: l.split(';')).items():
        g.add_node(tweet, Type="tweet_id")
        for hashtag in hashtags:
            g.add_edge(tweet, hashtag.lower())
    return g



In [9]:

    
g = buildHashtagCooccurrenceGraph(damd)

Now, let's briefly inspect the graph g we created.



In [10]:

    
print(nx.info(g))









    



Name: Hashtag co-occurrence bipartite
Type: Graph
Number of nodes: 2760
Number of edges: 4798
Average degree:   3.4768

Save to file, for opening in Gephi.



In [9]:

    
nx.write_gexf(g, "hashtag-cooccurrence-bipartite-with-python.gexf")

Compare the results of graph creation with Table 2 Net and Python

Read in the graph made with Table 2 Net.



In [10]:

    
g_table2net = nx.read_gexf("hashtag-cooccurrence-bipartite-with-table2net.gexf")
print(nx.info(g_table2net))









    



Name: 
Type: Graph
Number of nodes: 2760
Number of edges: 4798
Average degree:   3.4768

After poking around in Gephi for half an hour setting colours and filters, positioning with ForceAtlas2 and outputting an image, here is a visualization of the graph. It should be equal to the one above, which was visualized from a graph constructed from the data with Table 2 Net.

In graph theory, "isomorphism" (ἴσος isos "equal", and μορφή morphe "form" or "shape") means that graphs are of the same shape. Why do want to know this? We want to inspect if we successfully reproduced the process that Table 2 Net did.



In [11]:

    
# This algoritm is not guaranteed, but it is fast
nx.isomorphism.fast_could_be_isomorphic(g, g_table2net)









    Out[11]:





True

Did we "open the black box" of Table 2 Net and Gephi?

	Unnamed: 0	user_id	user_name	reply_to_id	created	message	geodata	place_id	place_type	place_name	place_country	language	retweet_count	hashtags	user_mentions_name	user_mentions_id	urls	media_id	media_type	media_url
tweet_id
885401672448589824	1	43302304	Motor Mavens	NaN	Thu Jul 13 07:33:03 +0000 2017	The @oemaudioplus #86Vantage's interior just l...	NaN	NaN	NaN	NaN	NaN	en	0	86Vantage;DAMD	OEM AUDIO PLUS	137555927	http://fb.me/6IdLxl68T	NaN	NaN	NaN
772829925279752196	2	94512824	Caspar de Kiefte	NaN	Mon Sep 05 16:13:07 +0000 2016	#DAMD -> via Kunstenbond onderdeel van inte...	NaN	NaN	NaN	NaN	NaN	nl	0	DAMD	NaN	NaN	http://damd.nl/nieuws/damd-via-kunstenbond-ver...	NaN	NaN	NaN
828122222111764480	3	798400767975686144	Bec	NaN	Sun Feb 05 06:04:58 +0000 2017	@Budah96 @sarahbuya4 #Damd Olivia went and too...	NaN	NaN	NaN	NaN	NaN	en	0	Damd;Damd;Scandal;sogood	Spider-Paco The 🌮;Sarah	165599878;53990004	NaN	NaN	NaN	NaN